You are allowed and encouraged to work with one partner on this project. Include your names, perm numbers, and whether you are taking the class for 131 or 231 credit.
You are welcome and encouraged to write up your report as a research paper (e.g. abstract, introduction, methods, results, conclusion) as long as you address each of the questions below. Alternatively, you can format the assignment like a long homework by addressing each question in parts.
There should be no raw R output in the body of the paper! All of your results should be formatted in a professional and visually appealing manner. That means, eather as a polished visualization or, for tabular data, a nicely formatted table (see the documentation for kable and kableExtra packages). If you feel you must include extensive raw R output, this should be included in an appendix, not the main report.
All R code should be available from your Rmarkdown file, but does not need to be shown in the body of the report! Use the chunk option echo=FALSE to exclude code from appearing in your writeup. In addition to your Rmarkdown file, you are required to submit the writuep as either a pdf document or an html file (both are acceptable).
The presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.
Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us. Predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets. For our final project, we will analyze the 2016 presidential election dataset.
Answer the following questions in one paragraph for each.
What makes voter behavior prediction (and thus election forecasting) a hard problem?
What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions?
What went wrong in 2016? What do you think should be done to make future predictions better?
election.raw = read.csv("data/election/election.csv") %>% as.tbl
census_meta = read.csv("data/census/metadata.csv", sep = ";") %>% as.tbl
census = read.csv("data/census/census.csv") %>% as.tbl
The meaning of each column in election.raw is clear except fips. The accronym is short for Federal Information Processing Standard.
In our dataset, fips values denote the area (US, state, or county) that each row of data represent. For example, a fips value of 6037 denotes Los Angeles County.
| county | fips | candidate | state | votes |
|---|---|---|---|---|
| Los Angeles County | 6037 | Hillary Clinton | CA | 2464364 |
| Los Angeles County | 6037 | Donald Trump | CA | 769743 |
| Los Angeles County | 6037 | Gary Johnson | CA | 88968 |
| Los Angeles County | 6037 | Jill Stein | CA | 76465 |
| Los Angeles County | 6037 | Gloria La Riva | CA | 21993 |
Some rows in election.raw are summary rows and these rows have county value of NA. There are two kinds of summary rows:
fips value of US.fips value.election.raw after removing rows with fips=2000. Provide a reason for excluding them. Please make sure to use the same name election.raw before and after removing those observations.election.raw <- election.raw %>%
filter(fips != "2000")
New dimensions: 5 columns x 18345 rows
Alaska has a fips value of 2000, so the rows where fips=2000 are indeed state-level summary of election results. However, the state-level summary rows of Alaska are already available when we read the data, so it makes no sense to have duplicate records.
Following is the first few rows of the census data:
| State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alabama | Autauga | 1948 | 940 | 1008 | 0.9 | 87.4 | 7.7 | 0.3 | 0.6 | 0.0 | 1503 | 61838 | 11900 | 25713 | 4548 | 8.1 | 8.4 | 34.7 | 17.0 | 21.3 | 11.9 | 15.2 | 90.2 | 4.8 | 0 | 0.5 | 2.3 | 2.1 | 25.0 | 943 | 77.1 | 18.3 | 4.6 | 0 | 5.4 |
| Alabama | Autauga | 2156 | 1059 | 1097 | 0.8 | 40.4 | 53.3 | 0.0 | 2.3 | 0.0 | 1662 | 32303 | 13538 | 18021 | 2474 | 25.5 | 40.3 | 22.3 | 24.7 | 21.5 | 9.4 | 22.0 | 86.3 | 13.1 | 0 | 0.0 | 0.7 | 0.0 | 23.4 | 753 | 77.0 | 16.9 | 6.1 | 0 | 13.3 |
| Alabama | Autauga | 2968 | 1364 | 1604 | 0.0 | 74.5 | 18.6 | 0.5 | 1.4 | 0.3 | 2335 | 44922 | 5629 | 20689 | 2817 | 12.7 | 19.7 | 31.4 | 24.9 | 22.1 | 9.2 | 12.4 | 94.8 | 2.8 | 0 | 0.0 | 0.0 | 2.5 | 19.6 | 1373 | 64.1 | 23.6 | 12.3 | 0 | 6.2 |
| Alabama | Autauga | 4423 | 2172 | 2251 | 10.5 | 82.8 | 3.7 | 1.6 | 0.0 | 0.0 | 3306 | 54329 | 7003 | 24125 | 2870 | 2.1 | 1.6 | 27.0 | 20.8 | 27.0 | 8.7 | 16.4 | 86.6 | 9.1 | 0 | 0.0 | 2.6 | 1.6 | 25.3 | 1782 | 75.7 | 21.2 | 3.1 | 0 | 10.8 |
| Alabama | Autauga | 10763 | 4922 | 5841 | 0.7 | 68.5 | 24.8 | 0.0 | 3.8 | 0.0 | 7666 | 51965 | 6935 | 27526 | 2813 | 11.4 | 17.5 | 49.6 | 14.2 | 18.2 | 2.1 | 15.8 | 88.0 | 10.5 | 0 | 0.0 | 0.6 | 0.9 | 24.8 | 5037 | 67.1 | 27.6 | 5.3 | 0 | 4.2 |
| Alabama | Autauga | 3851 | 1787 | 2064 | 13.1 | 72.9 | 11.9 | 0.0 | 0.0 | 0.0 | 2642 | 63092 | 9585 | 30480 | 7550 | 14.4 | 21.9 | 24.2 | 17.5 | 35.4 | 7.9 | 14.9 | 82.7 | 6.9 | 0 | 0.0 | 6.0 | 4.5 | 19.8 | 1560 | 79.4 | 14.7 | 5.8 | 0 | 10.9 |
Column information is given in the metadata file.
Move summary rows from election.raw data into federal or state level summary files: i.e.,
Federal-level summary into a election_federal.
State-level summary into a election_state.
Only county-level data is to remain in election.
election_federal <- election.raw %>%
filter(fips == "US")
election_state <- election.raw %>%
filter(fips != "US" & is.na(county))
election <- election.raw %>%
filter(!is.na(county))
There were 31 candidates in the 2016 election.
order_candidates <- election_federal[order(election_federal$votes),]
candiates_total <- length(unique((election.raw$candidate)))
by_candidate <- election.raw %>%
group_by(candidate) %>%
summarise(total_votes = sum(votes))
candidates <- ggplot(election_federal, aes(x=reorder(candidate, -votes), y=votes)) +
geom_bar(stat="identity", fill = "#56B4E9") +
labs(title='Candidate National Votes', x = 'Candidate Name', y = 'Vote Count') +
theme(axis.text.x= element_text(angle=-75, size = 8, hjust = .2), panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank(), axis.line = element_line(colour = "black"), plot.title = element_text(hjust = 0.5))
candidates
county_winner and state_winner by taking the candidate with the highest proportion of votes. Hint: to create county_winner, start with election, group by fips, compute total votes, and pct = votes/total. Then choose the highest row using top_n (variable state_winner is similar).county_winner <- election %>%
group_by(fips) %>%
mutate(total = sum(votes)) %>%
mutate(pct = votes/total) %>%
top_n(n=1, pct)
state_winner <- election_state %>%
group_by(fips) %>%
mutate(total = sum(votes)) %>%
mutate(pct = votes/total) %>%
top_n(n=1, pct)
Visualization is crucial for gaining insight and intuition during the data mining process. To that end, we will generate cartographic representations (maps) of the states and counties, and map our data onto these representations.
The R package ggplot2 can be used to draw maps. Consider the following code.
states <- map_data("state")
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE) # color legend is unnecessary and takes too long
The variable states contains information to draw white polygons, while the fill-colors are determined by region.
counties = map_data("county") and color by county.counties = map_data("county")
ggplot(data = counties) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE)
Now color the map by the winning candidate for each state. First, use left_join() to combine the states variable and the state_winner variable we created earlier. Note that left_join() needs to match up values of states to join the tables. A call to left_join() takes all the values from the first table and looks for matches in the second table. For each match, left_join() appends the data from the second table to the matching row in the first; if no matching value is found, it adds missing values:
Here, we’ll be combing the two datasets based on state name. However, the state names are in different formats in the two tables: e.g. AZ vs. arizona. Before using left_join(), create a common column by creating a new column for states named fips = state.abb[match(some_column, some_function(state.name))]. Replace some_column and some_function to complete creation of this new column. Then left_join(). Your figure will look similar to this state level New York Times map.
states$fips = state.abb[match(states$region, cbind(tolower(state.name), state.abb))]
state_winner$fips <- as.factor(state_winner$fips)
states <- left_join(states, state_winner, by = 'fips')
## Warning: Column `fips` joining character vector and factor, coercing into
## character vector
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE)
county does not have a fips value. So we will create one by pooling information from maps::county.fips. Split the polyname column to region and subregion. Use left_join() to combine county.fips into county. Also, left_join() previously created variable county_winner. Your figure will look similar to county-level New York Times map.county.fips <- maps::county.fips
county.fips$fips <- as.factor(county.fips$fips)
county.fips <- county.fips %>%
separate(polyname, c('region', 'subregion'), sep = ",")
county.fips = left_join(county.fips, county_winner, by= 'fips')
## Warning: Column `fips` joining factors with different levels, coercing to
## character vector
counties = left_join(counties, county.fips, by=c('region', 'subregion'))
ggplot(data = counties) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white", size = .1) +
coord_fixed(1.3) +
guides(fill=FALSE)
Create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.
The census data contains high resolution information (more fine-grained than county-level).
In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:
census.del: start with census, filter out any rows with missing values, convert {Men, Employed, Citizen} attributes to percentages (meta data seems to be inaccurate), compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific}, remove these variables after creating Minority, remove {Walk, PublicWork, Construction}.census.del = census[complete.cases(census),] %>%
mutate(Men = Men / TotalPop, Employed = Employed / TotalPop, Citizen = Citizen / TotalPop) %>%
mutate(Minority = Hispanic + Black + Native + Asian + Pacific) %>%
select(-Walk, -PublicWork, -Construction, -Hispanic, -Black, -Native, -Asian, -Pacific, -Women)
* _Sub-county census data, `census.subct`_:
start with `census.del` from above, `group_by()` two attributes {`State`, `County`},
use `add_tally()` to compute `CountyTotal`. Also, compute the weight by `TotalPop/CountyTotal`.
census.subct = census.del %>%
group_by(State,County) %>%
add_tally(TotalPop) %>%
rename(CountyTotal = n) %>%
mutate("Weight" = TotalPop/CountyTotal)%>% mutate_at(vars(c(Men:Citizen),c(Poverty:Minority)),funs(.*Weight)) #only include columns selected and then multiply all by the weight
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
* _County census data, `census.ct`_:
start with `census.subct`, use `summarize_at()` to compute the weighted sum.
census.ct = census.subct %>%
group_by(State, County) %>%
summarize_at(vars(c(Men:Minority)), funs(sum(.)))
* _Print the first few rows of `census.ct`_:
head(census.ct)
## # A tibble: 6 x 27
## # Groups: State [1]
## State County Men White Citizen Income IncomeErr IncomePerCap
## <fct> <fct> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 Alab… Autau… 0.484 75.8 0.737 6.00e5 96427 292643
## 2 Alab… Baldw… 0.488 83.1 0.757 1.51e6 287310 832126
## 3 Alab… Barbo… 0.538 46.2 0.769 2.91e5 53023 153941
## 4 Alab… Bibb 0.534 74.5 0.774 1.61e5 24571 75228
## 5 Alab… Blount 0.494 87.9 0.734 4.06e5 80281 181542
## 6 Alab… Bullo… 0.530 22.2 0.755 1.00e5 25534 53205
## # … with 19 more variables: IncomePerCapErr <int>, Poverty <dbl>,
## # ChildPoverty <dbl>, Professional <dbl>, Service <dbl>, Office <dbl>,
## # Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>,
## # OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>,
## # Employed <dbl>, PrivateWork <dbl>, SelfEmployed <dbl>,
## # FamilyWork <dbl>, Unemployment <dbl>, Minority <dbl>
ct.pc and subct.pc, for county and sub-county respectively. Discuss whether you chose to center and scale the features before running PCA and the reasons for your choice. What are the three features with the largest absolute values of the first principal component? Which features have opposite signs and what does that mean about the correaltion between these features?pr_ct = census.ct %>%
ungroup %>%
select(c(Men:Minority)) %>%
prcomp(scale = TRUE)
pr_subct = census.subct %>%
ungroup %>%
select(c(Men:Minority)) %>%
prcomp(scale = TRUE)
ct.pc = as.data.frame(pr_subct$rotation[,1:2])
ct.pc.abs = abs(ct.pc)
ct.pc.order = order(ct.pc.abs)
subct.pc = as.data.frame(pr_subct$rotation[,1:2])
subct.pc.abs = abs(ct.pc)
kable(subct.pc[order(abs(pr_subct$rotation[,1]), decreasing=TRUE), ], col.names = c("PC1 Ordered", "PC2"), caption = "Subcounty PCA Loadings") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| PC1 Ordered | PC2 | |
|---|---|---|
| Men | 0.2517602 | -0.0355148 |
| Citizen | 0.2516750 | -0.0371084 |
| PrivateWork | 0.2492251 | -0.0250022 |
| Drive | 0.2489331 | -0.0253583 |
| Office | 0.2449174 | -0.0239673 |
| Employed | 0.2442508 | -0.0589544 |
| Service | 0.2417955 | -0.0007899 |
| Professional | 0.2410704 | -0.0747356 |
| MeanCommute | 0.2401601 | -0.0235876 |
| White | 0.2344204 | -0.0658673 |
| Carpool | 0.2314233 | -0.0169159 |
| Production | 0.2280590 | 0.0096322 |
| Poverty | 0.2251172 | 0.0544654 |
| ChildPoverty | 0.2178755 | 0.0547904 |
| SelfEmployed | 0.2152096 | -0.0820949 |
| Unemployment | 0.2057795 | 0.0562520 |
| WorkAtHome | 0.1886831 | -0.0985581 |
| Minority | 0.1651724 | 0.0587213 |
| OtherTransp | 0.1294814 | -0.0088174 |
| FamilyWork | 0.1184700 | -0.0668117 |
| Transit | 0.0688494 | -0.0362329 |
| Income | -0.0539271 | -0.4942202 |
| IncomeErr | -0.0502902 | -0.4396261 |
| IncomePerCap | -0.0487308 | -0.5435088 |
| IncomePerCapErr | -0.0416104 | -0.4619181 |
kable(subct.pc[order(abs(pr_subct$rotation[,2]), decreasing=TRUE), ], col.names = c("PC1", "PC2 Ordered"), caption = "Subcounty PCA Loadings") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| PC1 | PC2 Ordered | |
|---|---|---|
| IncomePerCap | -0.0487308 | -0.5435088 |
| Income | -0.0539271 | -0.4942202 |
| IncomePerCapErr | -0.0416104 | -0.4619181 |
| IncomeErr | -0.0502902 | -0.4396261 |
| WorkAtHome | 0.1886831 | -0.0985581 |
| SelfEmployed | 0.2152096 | -0.0820949 |
| Professional | 0.2410704 | -0.0747356 |
| FamilyWork | 0.1184700 | -0.0668117 |
| White | 0.2344204 | -0.0658673 |
| Employed | 0.2442508 | -0.0589544 |
| Minority | 0.1651724 | 0.0587213 |
| Unemployment | 0.2057795 | 0.0562520 |
| ChildPoverty | 0.2178755 | 0.0547904 |
| Poverty | 0.2251172 | 0.0544654 |
| Citizen | 0.2516750 | -0.0371084 |
| Transit | 0.0688494 | -0.0362329 |
| Men | 0.2517602 | -0.0355148 |
| Drive | 0.2489331 | -0.0253583 |
| PrivateWork | 0.2492251 | -0.0250022 |
| Office | 0.2449174 | -0.0239673 |
| MeanCommute | 0.2401601 | -0.0235876 |
| Carpool | 0.2314233 | -0.0169159 |
| Production | 0.2280590 | 0.0096322 |
| OtherTransp | 0.1294814 | -0.0088174 |
| Service | 0.2417955 | -0.0007899 |
kable(ct.pc[order(abs(pr_ct$rotation[,1]), decreasing=TRUE), ], col.names = c("PC1 Ordered", "PC2"), caption = "County PCA Loadings") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| PC1 Ordered | PC2 | |
|---|---|---|
| Poverty | 0.2251172 | 0.0544654 |
| ChildPoverty | 0.2178755 | 0.0547904 |
| Unemployment | 0.2057795 | 0.0562520 |
| Employed | 0.2442508 | -0.0589544 |
| Minority | 0.1651724 | 0.0587213 |
| White | 0.2344204 | -0.0658673 |
| WorkAtHome | 0.1886831 | -0.0985581 |
| Professional | 0.2410704 | -0.0747356 |
| Service | 0.2417955 | -0.0007899 |
| SelfEmployed | 0.2152096 | -0.0820949 |
| MeanCommute | 0.2401601 | -0.0235876 |
| Drive | 0.2489331 | -0.0253583 |
| FamilyWork | 0.1184700 | -0.0668117 |
| Production | 0.2280590 | 0.0096322 |
| Carpool | 0.2314233 | -0.0169159 |
| Office | 0.2449174 | -0.0239673 |
| Citizen | 0.2516750 | -0.0371084 |
| Men | 0.2517602 | -0.0355148 |
| Income | -0.0539271 | -0.4942202 |
| IncomePerCap | -0.0487308 | -0.5435088 |
| OtherTransp | 0.1294814 | -0.0088174 |
| PrivateWork | 0.2492251 | -0.0250022 |
| IncomePerCapErr | -0.0416104 | -0.4619181 |
| IncomeErr | -0.0502902 | -0.4396261 |
| Transit | 0.0688494 | -0.0362329 |
kable(ct.pc[order(abs(pr_ct$rotation[,2]), decreasing=TRUE), ], col.names = c("PC1", "PC2 Ordered"), caption = "County PCA Loadings") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| PC1 | PC2 Ordered | |
|---|---|---|
| IncomePerCap | -0.0487308 | -0.5435088 |
| IncomePerCapErr | -0.0416104 | -0.4619181 |
| IncomeErr | -0.0502902 | -0.4396261 |
| Income | -0.0539271 | -0.4942202 |
| Transit | 0.0688494 | -0.0362329 |
| Citizen | 0.2516750 | -0.0371084 |
| Professional | 0.2410704 | -0.0747356 |
| White | 0.2344204 | -0.0658673 |
| Minority | 0.1651724 | 0.0587213 |
| Production | 0.2280590 | 0.0096322 |
| Office | 0.2449174 | -0.0239673 |
| SelfEmployed | 0.2152096 | -0.0820949 |
| Drive | 0.2489331 | -0.0253583 |
| MeanCommute | 0.2401601 | -0.0235876 |
| PrivateWork | 0.2492251 | -0.0250022 |
| Employed | 0.2442508 | -0.0589544 |
| Men | 0.2517602 | -0.0355148 |
| FamilyWork | 0.1184700 | -0.0668117 |
| OtherTransp | 0.1294814 | -0.0088174 |
| Carpool | 0.2314233 | -0.0169159 |
| ChildPoverty | 0.2178755 | 0.0547904 |
| Unemployment | 0.2057795 | 0.0562520 |
| Poverty | 0.2251172 | 0.0544654 |
| WorkAtHome | 0.1886831 | -0.0985581 |
| Service | 0.2417955 | -0.0007899 |
After the 7th Principal Component, 90 percent of the variance is explained at the sub-county level. After the 12th Principal Component, 90 percent of the variane is explained at the county level.
pc_var_subct = pr_subct$sdev^2
pve_subct = pc_var_subct/sum(pc_var_subct)
cumulative_pve_subct = cumsum(pve_subct)
min(which(cumulative_pve_subct > .9))
## [1] 7
pc_var_ct = pr_ct$sdev^2
pve_ct = pc_var_ct/sum(pc_var_ct)
cumulative_pve_ct = cumsum(pve_ct)
min(which(cumulative_pve_ct > .9))
## [1] 12
census.ct, perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 5 principal components of ct.pc as inputs instead of the originald features. Compare and contrast the results. For both approaches investigate the cluster that contains San Mateo County. Which approach seemed to put San Mateo County in a more appropriate cluster? Comment on what you observe and discuss possible explanations for these observations.In order to train classification models, we need to combine county_winner and census.ct data. This seemingly straightforward task is harder than it sounds. The following code makes the necessary changes to merge them into election.cl for classification.
tmpwinner <- county_winner %>% ungroup %>%
mutate(state = state.name[match(state, state.abb)]) %>% ## state abbreviations
mutate_at(vars(state, county), tolower) %>% ## to all lowercase
mutate(county = gsub(" county| columbia| city| parish", "", county)) ## remove suffixes
tmpcensus <- census.ct %>% mutate_at(vars(State, County), tolower)
election.cl <- tmpwinner %>%
left_join(tmpcensus, by = c("state"="State", "county"="County")) %>%
na.omit
## save meta information
election.meta <- election.cl %>% select(c(county, fips, state, votes, pct, total))
## save predictors and class labels
election.cl = election.cl %>% select(-c(county, fips, state, votes, pct, total))
Using the following code, partition data into 80% training and 20% testing:
set.seed(10)
n <- nrow(election.cl)
in.trn <- sample.int(n, 0.8*n)
trn.cl <- election.cl[ in.trn,]
tst.cl <- election.cl[-in.trn,]
Using the following code, define 10 cross-validation folds:
set.seed(20)
nfold <- 10
folds <- sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))
Using the following error rate function:
calc_error_rate = function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso")
Decision tree: train a decision tree by cv.tree(). Prune the resulting tree to minimize misclassification error. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to a records variable. Intepret and discuss the results of the decision tree analysis. Use this plot to tell a story about voting behavior in the US (remember the NYT infographic?)
Run a logistic regression to predict the winning candidate in each county. Save training and test errors to the records variable. What are the significant variables? Are these consistent with what you observed in the decision tree analysis? Interpret the meaning of a couple of the significant coefficients in terms of a unit change in the variables. Did your particular county (from question 13) results match the predicted results?
You may notice that you get a warning glm.fit: fitted probabilities numerically 0 or 1 occurred. As we discussed in class, this is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. Use the cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression under the LASSO penalty. Reminder: set alpha=1 to run LASSO. What are the non-zero coefficients in the LASSO regression for the optimal value of \(\lambda\)? How do they compare to the unpenalized logistic regression? Save training and test errors to the records variable.
Compute ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data. Display them on the same plot. Based on your classification results, discuss the pros and cons of the various methods. Are the different classifiers more appropriate for answering different kinds of questions about the election?
Some possibilities for further exploration are:
Data preprocessing: we aggregated sub-county level data before performing classification. Would classification at the sub-county level before determining the winner perform better? What implicit assumptions are we making?
Exploring additional classification methods: KNN, LDA, QDA, SVM, random forest, boosting etc. (You may research and use methods beyond those covered in this course). How do these compare to logistic regression and the tree method?
Bootstrap: Perform boostrap to generate plots similar to ISLR Figure 4.10/4.11. Discuss the results.
Use linear regression models to predict the total vote for each candidate by county. Compare and contrast these results with the classification models. Which do you prefer and why? How might they complement one another?
Conduct an exploratory analysis of the “purple” counties– the counties which the models predict Clinton and Trump were roughly equally likely to win. What is it about these counties that make them hard to predict?
Instead of using the native attributes (the original features), we can use principal components to create new (and lower dimensional) sets of features with which to train a classification model. This sometimes improves classification performance. Compare classifiers trained on the original features with those trained on PCA features.